The final project is comprised of two parts:
Deliverables: Submit both parts as one notebook via Github by midnight on the due date above along with clear instructions on how to download the datasets you used for Part II and reproduce your results. The notebook should be organized with a clear table of contents on top (see example in the Pylaski notebook from Day 5) and links to the parts/steps outlined. Don't forget to add your name on top as the author of the notebook.
Use numpy to load iris.npy into a numpy matrix. Print the dataset's shape and the first 5 rows.
Output required:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
# For use later
column_names = ['Id','SepalLengthCm','SepalWidthCm','PetalLengthCm','PetalWidthCm','Species']
species_encoding = {'Iris-setosa': 1, 'Iris-versicolor': 2, 'Iris-virginica': 3}
# your solution
iris_data = np.load('iris.npy') # load data
print("iris_data shape:",iris_data.shape) # Print shape
print(iris_data[:5]) # Print first five rows
The first column is the id of the sample, which isn't relevant for our purposes. Remove that column from the matrix by creating a new matrix composed of the rest of the columns.
As usual, print the shape of the resulting dataset and the first 5 rows.
Output required:
# your solution
iris_data_clean = iris_data[:,1:]
print("iris_data_clean shape:",iris_data_clean.shape) # Print shape
print(iris_data_clean[:5]) # Print first five rows
Note: Don't worry about the order in which you display the values in this section. Display them in whatever order/grouping makes most sense to you
a) Print the means and standard deviations of each column.
Output required:
# your solution
means = np.mean(iris_data_clean,axis=0) # get means of cols
stddevs = np.std(iris_data_clean,axis=0) # get std. dev. of cols
print("means: ",means)
print("std devs: ",stddevs)
b) Print the minimum and maximum values of each column
Output required:
# your solution
minimums = np.min(iris_data_clean,axis=0) # get means of cols
maximums = np.max(iris_data_clean,axis=0) # get std. dev. of cols
print("minimums: ",minimums)
print("maximums: ",maximums)
c) Calculate the species-wise means and standard deviations.
Report these values with respect to the actual name of the species, for which you must refer to 1.1
Output required:
# your solution
iris_species_1 = iris_data_clean[iris_data_clean[:,4] == 1]
iris_species_2 = iris_data_clean[iris_data_clean[:,4] == 2]
iris_species_3 = iris_data_clean[iris_data_clean[:,4] == 3]
iris_species_1_means = np.mean(iris_species_1[:,:-1],axis=0)
iris_species_2_means = np.mean(iris_species_2[:,:-1],axis=0)
iris_species_3_means = np.mean(iris_species_3[:,:-1],axis=0)
print("Iris-setosa Means: ",iris_species_1_means)
print("Iris-versicolor Means: ",iris_species_2_means)
print("Iris-virginica Means: ",iris_species_3_means)
Use list comprehensions to generate a list of tuples for each species.
For a given species, the list will represent columns and their mean values. So, each tuple will be of the form (column_name, column_mean) and you'll have one per column. You can check your intuition using your 1.3c output
Note that the column names are listed in 1.1 and recall that you dropped the id column.
Each list will have the following format:
[(column_name, column_mean), (column_name, column_mean), ...]
hint: The enumerate function might be helpful in creating a concise comprehension
Output required:
# your solution
for species in range(0,len(species_encoding)):
print('\nList of Tuples for Species ',species+1,':\n')
mylist = [(column_names[i] , np.mean(iris_data_clean[iris_data_clean[:,4] == species+1],axis = 0)[i-1]) for i,column in enumerate(column_names)]
mylist = mylist[1:-1]
print(mylist)
This project is culmination of all you’ve learned in this course! You should expect to spend 24-32 total hours on the project. Be sure to read all of the items below before starting.
There are a number of steps outlined below, but is critical that you do not view this as an entirely linear process. Remember that the science component in data science is the creation of a hypothesis based on exploration and testing of that hypothesis through analysis. You may need to go through many of these steps multiple times before you arrive at meaningful hypothesis or conclusions.
I attempted to build a machine learning model using multiple linear regression to predict the price for AirBnB listings in San Diego, Los Angeles, and San Francisco based on factors such as home features, location, and listing descriptions, and Google Trends search data. My primary goal was to answer whether a multiple linear regression can be used to accurately predict prices across distinct cities?
1: Loading/Joining the Datasets
3: Encode Categorical Variables
8: Further Insights, Scatterplot Matrix
9: Build and Analyze 1st MLR Model
10: Build and Analyze Improved MLR Model with Engineered Features
11: Build and Analyze Final MLR Model
12: Individual City MLR Models: San Diego
13: Individual City MLR Models: Los Angeles
14: Individual City MLR Models: San Francisco
I utilized three (3) datasets obtained through InsideAirBnB, an independent open-source data tool that publishes datasets of AirBnB listings in different areas throughout the globe. I focused on a few of California's most populous cities, shown below:
In addition, a Google Trends search was produced to calculate a custom "demand" metric. The Google Trends search can be queried here.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
# Load datasets
los_angeles = pd.read_csv("la_listings_2.csv")
san_diego = pd.read_csv('sd_listings_2.csv')
san_francisco = pd.read_csv('sf_listings_2.csv')
# Load Google Trends data for SD, LA, and SF AirBnB web searches.
google_trends_data = pd.read_csv('GoogleTrends_sf_sd_la.csv')
# Remove host name (privacy)
los_angeles = los_angeles.drop("host_name",axis = 1)
san_diego = san_diego.drop("host_name",axis = 1)
san_francisco = san_francisco.drop("host_name",axis = 1)
# Remove additional uneccessary/private columns
los_angeles = los_angeles.drop(labels = ['scrape_id','last_scraped','experiences_offered','notes','transit','access','interaction','house_rules',
'thumbnail_url','medium_url','picture_url','xl_picture_url','host_id','host_url','host_about', 'neighbourhood',
'host_thumbnail_url','host_picture_url','host_verifications','host_identity_verified','smart_location','state',
'is_location_exact','security_deposit','cleaning_fee','guests_included','extra_people','calendar_updated',
'calendar_last_scraped','requires_license','license','jurisdiction_names','require_guest_profile_picture'],axis = 1)
san_diego = san_diego.drop(labels = ['scrape_id','last_scraped','experiences_offered','notes','transit','access','interaction','house_rules',
'thumbnail_url','medium_url','picture_url','xl_picture_url','host_id','host_url','host_about','neighbourhood',
'host_thumbnail_url','host_picture_url','host_verifications','host_identity_verified','smart_location','state',
'is_location_exact','security_deposit','cleaning_fee','guests_included','extra_people','calendar_updated',
'calendar_last_scraped','requires_license','license','jurisdiction_names','require_guest_profile_picture'],axis = 1)
san_francisco = san_francisco.drop(labels = ['scrape_id','last_scraped','experiences_offered','notes','transit','access','interaction','house_rules',
'thumbnail_url','medium_url','picture_url','xl_picture_url','host_id','host_url','host_about','neighbourhood',
'host_thumbnail_url','host_picture_url','host_verifications','host_identity_verified','smart_location','state',
'is_location_exact','security_deposit','cleaning_fee','guests_included','extra_people','calendar_updated',
'calendar_last_scraped','requires_license','license','jurisdiction_names','require_guest_profile_picture'],axis = 1)
# Create city variable for each dataset
los_angeles["city"] = "Los Angeles"
san_diego["city"] = "San Diego"
san_francisco["city"] = "San Francisco"
Google Trends data can be used to calculate a metric for demand by determining how many monthly searches of the following occurred over a given period of time:
I then averaged these results for each city to determine how many searches have occured per month in 2018.
Lastly, I normalized the average number of monthly searches in 2018 by dividing by each city's population. Now, we can have a reasonable metric for determining demand for AirBnB rentals, which may be an indicator in price for a city's listings.
# Filter Google Trends data to 2018 only
google_trends_data = google_trends_data[google_trends_data['Month'].str.contains('2018') == True]
# Define populations of each city in millions
sd_population = 1394928 / 1e6
la_population = 3971883 / 1e6
sf_population = 884521 / 1e6
# Determine number of listing in each city
sd_num_listings = len(san_diego)
la_num_listings = len(los_angeles)
sf_num_listings = len(san_francisco)
# Determine "demand" score by averaging 2018 search data for each city
sd_demand_score = google_trends_data['san diego airbnb: (Worldwide)'].mean() / sd_num_listings #sd_population
la_demand_score = google_trends_data['los angeles airbnb: (Worldwide)'].mean() / la_num_listings #la_population
sf_demand_score = google_trends_data['san francisco airbnb: (Worldwide)'].mean() / sf_num_listings #sf_population
# Add "demand" score to each listings dataset
los_angeles["demand score"] = la_demand_score
san_diego["demand score"] = sd_demand_score
san_francisco["demand score"] = sf_demand_score
# Stack LA, SD, and SF datasets together
data_all_cities = pd.concat([los_angeles,san_diego,san_francisco])
data_all_cities.head(4)
Raw dataset is now compiled. There is still a lot of cleaning to be done, however...
In summary, the primary removals and modifications to the raw data were as follows:
# Remove all columns that have > 30% of values missing.
filtered_data_all_cities = data_all_cities
for column in data_all_cities.keys():
if data_all_cities[column].isnull().sum() > len(data_all_cities)*0.30:
filtered_data_all_cities = filtered_data_all_cities.drop(labels = column, axis = 1)
else:
continue
# View number of NaN in each column
filtered_data_all_cities.isna().sum()
# Remove any rows that contain NaN for columns I intend to have in this project/analysis.
filtered_data_all_cities = filtered_data_all_cities.dropna(
subset = ['review_scores_communication','review_scores_location','review_scores_value','reviews_per_month',
'review_scores_cleanliness','review_scores_checkin','review_scores_rating','review_scores_accuracy',
'bathrooms','bedrooms','beds','host_listings_count','name'
])
# Remove dollar sign ($) and "," from price column.
filtered_data_all_cities['price'] = filtered_data_all_cities['price'].str.replace(',', '')
filtered_data_all_cities['price'] = filtered_data_all_cities['price'].str.replace('$', '')
filtered_data_all_cities['price'] = filtered_data_all_cities['price'].astype('float')
# Remove rows with price = < $20
filtered_data_all_cities = filtered_data_all_cities[filtered_data_all_cities['price'] > 20]
# Remove rows that are not in the United States
filtered_data_all_cities = filtered_data_all_cities[filtered_data_all_cities['country'] == 'United States']
# Filter dataset for hosts with 3 or more reviews
filtered_data_all_cities = filtered_data_all_cities[filtered_data_all_cities["number_of_reviews"] >= 3]
filtered_data_all_cities.head()
filtered_data_all_cities.describe()
print("All data: ",data_all_cities.shape)
print('Data with >= 3 reviews: ',filtered_data_all_cities.shape)
# Visualize removed data with original data, confirm we don't lose too much data
plt.figure(figsize = [15,4]);
plt.hist(data_all_cities.number_of_reviews,bins = 100,range=[0,100]);
# Remove new listings and listings without 3 or more reviews
plt.hist(filtered_data_all_cities.number_of_reviews,bins = 100,range=[0,100],);
plt.xlabel("Number of Reviews");
plt.ylabel('Frequency');
plt.title('Histogram of Number of Reviews per Listing in San Diego, San Franscisco, and Los Angeles');
plt.legend(['Original Dataset','Filtered Dataset >= 3 Reviews']);
Here, I needed to encode numeric values for certain categorical columns, such that I could utilize the values for modeling purposes.
This greatly increases the dimensions of the dataframe, but provides great value in being able to encode for features such as whether an AirBnB listing is an "entire home" or a "shared room", features that are likely to be very important in building a model for price.
# Encode numerical values for True/False variables
filtered_data_all_cities = pd.get_dummies(filtered_data_all_cities,columns = ['host_is_superhost'])
filtered_data_all_cities = pd.get_dummies(filtered_data_all_cities,columns = ['instant_bookable'])
filtered_data_all_cities = pd.get_dummies(filtered_data_all_cities,columns = ['host_has_profile_pic'])
filtered_data_all_cities = pd.get_dummies(filtered_data_all_cities,columns = ['require_guest_phone_verification'])
filtered_data_all_cities = filtered_data_all_cities.drop(labels = 'host_is_superhost_f',axis = 1)
filtered_data_all_cities = filtered_data_all_cities.drop(labels = 'instant_bookable_f',axis = 1)
filtered_data_all_cities = filtered_data_all_cities.drop(labels = 'host_has_profile_pic_f',axis = 1)
filtered_data_all_cities = filtered_data_all_cities.drop(labels = 'require_guest_phone_verification_f',axis = 1)
# Encode numerical values for categorical variables
filtered_data_all_cities = pd.get_dummies(filtered_data_all_cities,columns = ['room_type'])
filtered_data_all_cities = pd.get_dummies(filtered_data_all_cities,columns = ['property_type'])
filtered_data_all_cities = pd.get_dummies(filtered_data_all_cities,columns = ['cancellation_policy'])
# neighbourhood_cleansed and city are necessary for plotting so get_dummies is instructed to work on another copy of the column.
filtered_data_all_cities['neighbourhood_cleansed_copy'] = filtered_data_all_cities['neighbourhood_cleansed']
filtered_data_all_cities['city_copy'] = filtered_data_all_cities['city']
filtered_data_all_cities = pd.get_dummies(filtered_data_all_cities,columns = ['neighbourhood_cleansed_copy'])
filtered_data_all_cities = pd.get_dummies(filtered_data_all_cities,columns = ['city_copy'])
filtered_data_all_cities.head()
The listings data can be visualized by price on a map of California (shown below).
import bokeh
from bokeh.sampledata import us_states, us_counties, us_cities
from bokeh.plotting import *
from bokeh.models import LogColorMapper, ColorBar, LogTicker
from bokeh.palettes import Reds9 as palette
us_states = us_states.data.copy()
us_counties = us_counties.data.copy()
us_cities = us_cities.data.copy()
ca_counties = {
code: county for code, county in us_counties.items() if county["state"] == "ca"
}
county_xs = [county["lons"] for county in ca_counties.values()]
county_ys = [county["lats"] for county in ca_counties.values()]
output_notebook()
TOOLS = "pan,wheel_zoom,reset,hover,save,box_zoom"
# init figure
p = figure(title="AirBnB Listings by Price in San Francisco, Los Angeles, and San Diego Counties", tools = TOOLS,
x_axis_location=None, y_axis_location=None,
toolbar_location="left", plot_width=1000, plot_height=1100,
tooltips=[
("Name", "@description"), ("Price", "$@price"), ('Lat', "@x"), ('Lon', '@y'), ('Neighborhood', '@neighborhood')
])
# Draw county lines
p.patches(county_xs, county_ys, fill_color = '#1d2430', line_color = 'black')
# The scatter markers
listings_xs = filtered_data_all_cities['longitude']
listings_ys = filtered_data_all_cities['latitude']
description = filtered_data_all_cities['name']
neighborhood = filtered_data_all_cities['neighbourhood_cleansed']
price = filtered_data_all_cities['price']
data=dict(
x=listings_xs,
y=listings_ys,
description=description,
neighborhood=neighborhood,
price=price
)
palette.reverse()
color_mapper = LogColorMapper(palette=palette)
p.scatter('x', 'y', source = data,
fill_color={'field': 'price', 'transform': color_mapper},
fill_alpha=0.7, line_color="white", line_width=0.05)
# Draw Legend
color_bar = ColorBar(color_mapper=color_mapper, ticker=LogTicker(),
label_standoff=0, border_line_color=None, location=(0,0))
p.add_layout(color_bar, 'center')
p.title.align = 'center'
# show results
show(p)
A few very interesting observations can be made from the state level. We see that the listings in San Francisco are very closely populated (given the size of San Francisco). Los Angeles has much more broadly scattered listings, and San Diego is somewhere in between.
All cities have pockets of very high price (dark red) areas that appear closely concentrated to coastal regions. This is understandable, as many visitors and tourists will choose an AirBnB that is close to the views and activities of the beach. From my own experience, I've often gotten questions from potential guests (particularly ones from out-of-state) about how close my rental is from the beach (despite it being some 20+ miles from the beach). Los Angeles sees hot spots in price in the coastal areas like Hermosa Beach, Santa Monica, and Malibu. Other hot spots can be seen in LA County's Catalina Island and notoriously wealthy Beverly Hills. San Diego sees hot spots in price in areas such as La Jolla, Mission Beach, and Downtown. And San Francisco listings are generally priced higher in areas near the Bay.
We see from this data that longitude and latitude itself won't be a reliable predictor of price, as it would only indicate which city the listing is from. This is because SF is inherently more westerly and northerly than LA and SD. As such, this will likely yield inconsequential results in a model for price, or at least be collinear with a variable for city.
Let's peer into each city separately...
sd_filtered_data = filtered_data_all_cities[filtered_data_all_cities['city'] == 'San Diego']
sd_filtered_data = filtered_data_all_cities[filtered_data_all_cities['city'] == 'San Diego']
sd_cities = {
code: city for code, city in us_counties.items() if city["name"] == "San Diego"
}
sd_county_xs = [county["lons"] for county in sd_cities.values()]
sd_county_ys = [county["lats"] for county in sd_cities.values()]
output_notebook()
TOOLS = "pan,wheel_zoom,reset,hover,save,box_zoom"
# init figure
p2 = figure(title="AirBnB Listings by Price in San Diego County", tools = TOOLS,
x_axis_location=None, y_axis_location=None,
x_range = [-117.4,-116.92], y_range = [32.5,33.1],
toolbar_location="left", plot_width=1000, plot_height=1100,
tooltips=[
("Name", "@description"), ("Price", "$@price"), ("Lat", "@x"), ("Lon", "@y"),("Neighborhood","@neighborhood")
])
# Draw county lines
p2.patches(sd_county_xs, sd_county_ys, fill_color = '#1d2430', line_color = 'black')
# The scatter markers
listings_xs = sd_filtered_data['longitude']
listings_ys = sd_filtered_data['latitude']
description = sd_filtered_data['name']
neighborhood = sd_filtered_data['neighbourhood_cleansed']
price = sd_filtered_data['price']
data=dict(
x=listings_xs,
y=listings_ys,
description=description,
neighborhood=neighborhood,
price=price
)
color_mapper = LogColorMapper(palette=palette)
p2.scatter('x', 'y', source = data,
fill_color={'field': 'price', 'transform': color_mapper},
fill_alpha=0.7, line_color="white", line_width=0.01, size = 5)
# Draw Legend
color_bar = ColorBar(color_mapper=color_mapper, ticker=LogTicker(),
label_standoff=0, border_line_color=None, location=(0,0))
p2.add_layout(color_bar, 'center')
p2.title.align = 'center'
# show results
show(p2)
la_filtered_data = filtered_data_all_cities[filtered_data_all_cities['city'] == 'Los Angeles']
la_cities = {
code: city for code, city in us_counties.items() if city["name"] == "Los Angeles"
}
la_county_xs = [county["lons"] for county in la_cities.values()]
la_county_ys = [county["lats"] for county in la_cities.values()]
output_notebook()
TOOLS = "pan,wheel_zoom,reset,hover,save,box_zoom"
# init figure
p3 = figure(title="AirBnB Listings by Price in Los Angeles County", tools = TOOLS,
x_axis_location=None, y_axis_location=None,
#x_range = [-117.4,-116.85],
y_range = [33.25,34.75],
toolbar_location="left", plot_width=1000, plot_height=1100,
tooltips=[
("Name", "@description"), ("Price", "$@price"), ("Lat", "@x"), ("Lon", "@y"),("Neighborhood","@neighborhood")
])
# Draw county lines
p3.patches(la_county_xs, la_county_ys, fill_color = '#1d2430', line_color = 'black')
# The scatter markers
listings_xs = la_filtered_data['longitude']
listings_ys = la_filtered_data['latitude']
description = la_filtered_data['name']
neighborhood = la_filtered_data['neighbourhood_cleansed']
price = la_filtered_data['price']
data=dict(
x=listings_xs,
y=listings_ys,
description=description,
neighborhood=neighborhood,
price=price
)
color_mapper = LogColorMapper(palette=palette)
p3.scatter('x', 'y', source = data,
fill_color={'field': 'price', 'transform': color_mapper},
fill_alpha=0.7, line_color="white", line_width=0.01, size = 5)
# Draw Legend
color_bar = ColorBar(color_mapper=color_mapper, ticker=LogTicker(),
label_standoff=0, border_line_color=None, location=(0,0))
p3.add_layout(color_bar, 'center')
p3.title.align = 'center'
# show results
show(p3)
sf_filtered_data = filtered_data_all_cities[filtered_data_all_cities['city'] == 'San Francisco']
sf_cities = {
code: city for code, city in us_counties.items() if city["name"] == "San Francisco"
}
sf_county_xs = [county["lons"] for county in sf_cities.values()]
sf_county_ys = [county["lats"] for county in sf_cities.values()]
output_notebook()
TOOLS = "pan,wheel_zoom,reset,hover,save,box_zoom"
# init figure
p4 = figure(title="AirBnB Listings by Price in San Francisco County", tools = TOOLS,
x_axis_location=None, y_axis_location=None,
x_range = [-122.52,-122.38],
y_range = [37.7,37.84],
toolbar_location="left", plot_width=1000, plot_height=1100,
tooltips=[
("Name", "@description"), ("Price", "$@price"), ("Lat", "@x"), ("Lon", "@y"),("Neighborhood","@neighborhood")
])
# Draw county lines
p4.patches(sf_county_xs, sf_county_ys, fill_color = '#1d2430', line_color = 'black')
# The scatter markers
listings_xs = sf_filtered_data['longitude']
listings_ys = sf_filtered_data['latitude']
description = sf_filtered_data['name']
neighborhood = sf_filtered_data['neighbourhood_cleansed']
price = sf_filtered_data['price']
data=dict(
x=listings_xs,
y=listings_ys,
description=description,
neighborhood = neighborhood,
price=price
)
color_mapper = LogColorMapper(palette=palette)
p4.scatter('x', 'y', source = data,
fill_color={'field': 'price', 'transform': color_mapper},
fill_alpha=0.7, line_color="white", line_width=0.01, size = 5)
# Draw Legend
color_bar = ColorBar(color_mapper=color_mapper, ticker=LogTicker(),
label_standoff=0, border_line_color=None, location=(0,0))
p4.add_layout(color_bar, 'center')
p4.title.align = 'center'
# show results
show(p4)
A good place to continue developing insights into this data for model building is a scatterplot matrix. This can be produced using the Seaborn library. A Scatterplot matrix helps visualize correlations across several variables at once. This can be combined with a distribution plot at each self-intersection of the scatterplot matrix to provide additional information.
Let's first identify some strong correlations prior to creating the scatterplot matrix, because creating a scatterplot matrix from all variables would be very computationally expensive...
# Identify strong correlations to price
correlations_to_price = filtered_data_all_cities.corr().price.abs().sort_values(ascending = False)
correlations_to_price
We observe that the number of bedrooms, number of people accomodates, number of bathrooms, number of beds, and even whether the property is a villa or not are very important indicators of listing price (i.e. they are well correlated, and have a correlation coefficient close to 1).
Interestingly and unfortunately, demand score does not appear to be a highly predictive indicator of listing price. This can be for several reasons. While Google searches for 'AirBnB San Diego' or 'AirBnB Los Angeles' can be used to determine a general level of interest for rentals in the area, it does not necessarily capture the majority of bookings done through AirBnB. It's likely that most bookings are achieved through the application/website itself, without first entering a Google search. Additionally, certain areas may be more synonymous with the neighborhood (example: "Hollywood", "Balboa Park", etc.) as opposed to the city. This may bias the demand score significantly. Larger land areas like Los Angeles are likely to be affected more by this than smaller areas such as San Francisco, as a would-be renter would be more likely to use a specific term that would be closer in miles to their desired destination. Further, there are other applications (HomeAway, VRBO, etc) that may have different market penetration in these areas, further biasing results.
Regardless, we can pick a handful of the best correlations with price thus far and produce a scatterplot matrix...
# Pick top 7 correlations to price for extended analysis in scatterplot matrix
top_7_price_factors = correlations_to_price.keys()[1:8] # skip first one because it is price.
top_7_price_factors = top_7_price_factors.values.tolist()
top_7_price_factors
scatterplot_matrix_data = filtered_data_all_cities[top_7_price_factors + ['city'] + ['price']] # also added city to color by city
# Create scatterplot matrix of top 7 factors
sns.set(style="ticks")
sns.pairplot(scatterplot_matrix_data, hue = 'city', diag_kind = 'kde',
plot_kws = {'alpha': 0.6, 's': 80, 'edgecolor': 'k'},
height = 4);
This gives us something to work with and visually validates our correlations produced earlier. We can now begin to create a model using Multiple Linear Regression using the scikit-learn library.
We can build a MLR model and test it by leveraging train_test_split() to create a training dataset and a testing dataset. I used 20% of my data to test, and 80% of my data to train.
# Create an initial model by defining features and target
y = filtered_data_all_cities[['price']]
list_of_all_factors = correlations_to_price.keys().tolist()[1:150] # Ignores "price" target variable
X = filtered_data_all_cities[list_of_all_factors]
# Partition data into Training and Testing set.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=0) # uses int seed of 0
from sklearn.linear_model import LinearRegression
# Train model
regressor = LinearRegression()
regressor.fit(X_train, y_train);
# Predict from model
y_pred = regressor.predict(X_test)
from sklearn import metrics
# Evaluate the model performance
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
print('R-squared:',regressor.score(X,y))
plt.figure(figsize=[15,15])
plt.plot(y_test,y_pred,'go',marker = 'o',markersize = 2);
plt.xlabel('Actual Value (y_test)');
plt.ylabel('Predicted Value (y_pred)');
plt.title('Multiple Linear Regression Model Performance',fontdict={'fontsize':14});
It's clear that the model could use some improvement (since R^2 is not near 1). Particularly for listings that are outliers in their price (listings costing $1000+/night), the model performs poorly and is likely to underprice a listing (i.e. it does not have the capability to identify the mega-mansions and ultra luxurious listings from the currently available features).
Perhaps we can achieve better model performance if we engineer features to capture unique properties about these homes...
# Filter for expensive homes ( >$1000/night )
expensive_homes = filtered_data_all_cities[filtered_data_all_cities['price'] > 1000]
# Find most common words in the summary description of the listing
expensive_homes.summary.str.split(expand=True).stack().value_counts()
num_expensive_homes = len(expensive_homes)
print("Perc. of expensive listings with 'private' in description:",expensive_homes.summary.str.contains('private').sum() / num_expensive_homes)
print("Perc. of expensive listings with 'view' or 'views' in description:",expensive_homes.summary.str.contains('view').sum() / num_expensive_homes)
We can see that expensive listings (greater than $1000/night) commonly have words like "private" or "views" / "view" in their summary. Makes sense! This may indicate some of the trends we observed when plotting listings onto the map of California, since coastal listings may be more likely to have words like "view", for example "beach view." "Private" is likely to be associated with homes that are not "shared" or not in a high-concentration dwelling, again an indicator of price. We can engineer features that contain the words "private" and "views" / "view" and use them to improve our model.
# Generate features for "private" and "view"
filtered_data_all_cities['desc_contains_private'] = np.where(filtered_data_all_cities['summary'].str.contains('private'), 1, 0)
filtered_data_all_cities['desc_contains_view'] = np.where(filtered_data_all_cities['summary'].str.contains('view'), 1, 0)
# Create new correlations to price list
correlations_to_price = filtered_data_all_cities.corr().price.abs().sort_values(ascending = False)
correlations_to_price
# Create improved model by defining features and target
y = filtered_data_all_cities[['price']]
list_of_all_factors = correlations_to_price.keys().tolist()[1:150] # Ignores "price" target variable
X = filtered_data_all_cities[list_of_all_factors]
# Partition data into Training and Testing set.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=0) # uses int seed of 0
from sklearn.linear_model import LinearRegression
# Train model
regressor = LinearRegression()
regressor.fit(X_train, y_train);
# Predict from model
y_pred = regressor.predict(X_test)
from sklearn import metrics
# Evaluate the model performance
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
print('R-squared:',regressor.score(X,y))
plt.figure(figsize=[15,15])
plt.plot(y_test,y_pred,'go',marker = 'o',markersize = 2);
plt.xlabel('Actual Value (y_test)');
plt.ylabel('Predicted Value (y_pred)');
plt.title('Multiple Linear Regression Model Performance',fontdict={'fontsize':14});
This showed minimal improvement in the model (The R^2 value did not drastically improve).
I then took another approach to generate features from the "amenities" that would improve the price prediction...
My approach was to browse the 'amenities' and identify potential indicators of more expensive homes, and generate features from these keywords. Amenities like "heated floors" were ideal because they would be likely to identify more expensive homes.
# Find features from amenities
expensive_homes.amenities.str.split(",",expand=True).stack().str.split('"',expand=True).stack().value_counts()[-150:-20]
# Perhaps we can utilize amenities like spa and fireplace to improve the model.
print('spa (%):',expensive_homes['amenities'].str.contains('spa').sum() / num_expensive_homes)
print('fireplace (%):',expensive_homes['amenities'].str.contains('fireplace').sum() / num_expensive_homes)
print('parking (%):',expensive_homes['amenities'].str.contains('parking').sum() / num_expensive_homes)
print('BBQ (%):',expensive_homes['amenities'].str.contains('BBQ').sum() / num_expensive_homes)
print('Washer (%):',expensive_homes['amenities'].str.contains('Washer').sum() / num_expensive_homes)
print('Hot tub (%):',expensive_homes['amenities'].str.contains('Hot tub').sum() / num_expensive_homes)
print('Private entrance (%):',expensive_homes['amenities'].str.contains('Private entrance').sum() / num_expensive_homes)
print('Wifi (%):',expensive_homes['amenities'].str.contains('Wifi').sum() / num_expensive_homes)
print('Coffee maker (%):',expensive_homes['amenities'].str.contains('Coffee maker').sum() / num_expensive_homes)
print('backyard (%):',expensive_homes['amenities'].str.contains('backyard').sum() / num_expensive_homes)
print('Suitable for events (%):',expensive_homes['amenities'].str.contains('Suitable for events').sum() / num_expensive_homes)
print('Sound system (%):',expensive_homes['amenities'].str.contains('Sound system').sum() / num_expensive_homes)
print('Wine cooler (%):',expensive_homes['amenities'].str.contains('Wine cooler').sum() / num_expensive_homes)
print('Heated floors (%):',expensive_homes['amenities'].str.contains('Heated floors').sum() / num_expensive_homes)
# Generate features for "private" and "view"
filtered_data_all_cities['desc_contains_spa'] = np.where(filtered_data_all_cities['amenities'].str.contains('spa'), 1, 0)
filtered_data_all_cities['desc_contains_fireplace'] = np.where(filtered_data_all_cities['amenities'].str.contains('fireplace'), 1, 0)
filtered_data_all_cities['desc_contains_parking'] = np.where(filtered_data_all_cities['amenities'].str.contains('parking'), 1, 0)
filtered_data_all_cities['desc_contains_BBQ'] = np.where(filtered_data_all_cities['amenities'].str.contains('BBQ'), 1, 0)
filtered_data_all_cities['desc_contains_Washer'] = np.where(filtered_data_all_cities['amenities'].str.contains('Washer'), 1, 0)
filtered_data_all_cities['desc_contains_Hot_tub'] = np.where(filtered_data_all_cities['amenities'].str.contains('Hot tub'), 1, 0)
filtered_data_all_cities['desc_contains_Private_entrance'] = np.where(filtered_data_all_cities['amenities'].str.contains('Private entrance'), 1, 0)
filtered_data_all_cities['desc_contains_Wifi'] = np.where(filtered_data_all_cities['amenities'].str.contains('Wifi'), 1, 0)
filtered_data_all_cities['desc_contains_Coffee_maker'] = np.where(filtered_data_all_cities['amenities'].str.contains('Coffee maker'), 1, 0)
filtered_data_all_cities['desc_contains_backyard'] = np.where(filtered_data_all_cities['amenities'].str.contains('backyard'), 1, 0)
filtered_data_all_cities['desc_contains_Suitable_for_events'] = np.where(filtered_data_all_cities['amenities'].str.contains('Suitable for events'), 1, 0)
filtered_data_all_cities['desc_contains_Sound_system'] = np.where(filtered_data_all_cities['amenities'].str.contains('Sound system'), 1, 0)
filtered_data_all_cities['desc_contains_Wine_cooler'] = np.where(filtered_data_all_cities['amenities'].str.contains('Wine cooler'), 1, 0)
filtered_data_all_cities['desc_contains_Heated_floors'] = np.where(filtered_data_all_cities['amenities'].str.contains('Heated floors'), 1, 0)
# Create new correlations to price list
correlations_to_price = filtered_data_all_cities.corr().price.abs().sort_values(ascending = False)
correlations_to_price
# Create improved model by defining features and target
y = filtered_data_all_cities[['price']]
list_of_all_factors = correlations_to_price.keys().tolist()[1:150] # Ignores "price" target variable
X = filtered_data_all_cities[list_of_all_factors]
# Partition data into Training and Testing set.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=0) # uses int seed of 0
from sklearn.linear_model import LinearRegression
# Train model
regressor = LinearRegression()
regressor.fit(X_train, y_train);
# Predict from model
y_pred = regressor.predict(X_test)
from sklearn import metrics
# Evaluate the model performance
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
print('R-squared:',regressor.score(X,y))
plt.figure(figsize=[15,15])
plt.plot(y_test,y_pred,'go',marker = 'o',markersize = 2);
plt.xlabel('Actual Value (y_test)');
plt.ylabel('Predicted Value (y_pred)');
plt.title('Multiple Linear Regression Model Performance',fontdict={'fontsize':14});
Model continues to slowly improve with the addition of features that may elucidate potential price of home, however, it appears we've approach an asymptote with regards to model performance. Perhaps we don't have enough information here to efficiently predict AirBnB listing prices for all cities.
A better approach may be to look at cities individually...
san_diego_filtered_data = filtered_data_all_cities[filtered_data_all_cities['city_copy_San Diego'] == 1]
# Create new correlations to price list
correlations_to_price = san_diego_filtered_data.corr().price.abs().sort_values(ascending = False)
correlations_to_price
# Create improved model by defining features and target
y = san_diego_filtered_data[['price']]
list_of_all_factors = correlations_to_price.keys().tolist()[1:80] # Ignores "price" target variable
X = san_diego_filtered_data[list_of_all_factors]
# Partition data into Training and Testing set.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=0) # uses int seed of 0
from sklearn.linear_model import LinearRegression
# Train model
regressor = LinearRegression()
regressor.fit(X_train, y_train);
# Predict from model
y_pred = regressor.predict(X_test)
from sklearn import metrics
# Evaluate the model performance
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
print('R-squared:',regressor.score(X,y))
plt.figure(figsize=[15,15])
plt.plot(y_test,y_pred,'go',marker = 'o',markersize = 2);
plt.xlabel('Actual Value (y_test)');
plt.ylabel('Predicted Value (y_pred)');
plt.title('Multiple Linear Regression Model Performance: San Diego',fontdict={'fontsize':14});
Amazing! We now have an R^2 value of nearly 0.6! This is a drastic improvement from the complete dataset of all cities.
los_angeles_filtered_data = filtered_data_all_cities[filtered_data_all_cities['city_copy_Los Angeles'] == 1]
# Create new correlations to price list
correlations_to_price = los_angeles_filtered_data.corr().price.abs().sort_values(ascending = False)
correlations_to_price
# Create improved model by defining features and target
y = los_angeles_filtered_data[['price']]
list_of_all_factors = correlations_to_price.keys().tolist()[1:80] # Ignores "price" target variable
X = los_angeles_filtered_data[list_of_all_factors]
# Partition data into Training and Testing set.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=0) # uses int seed of 0
from sklearn.linear_model import LinearRegression
# Train model
regressor = LinearRegression()
regressor.fit(X_train, y_train);
# Predict from model
y_pred = regressor.predict(X_test)
from sklearn import metrics
# Evaluate the model performance
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
print('R-squared:',regressor.score(X,y))
plt.figure(figsize=[15,15])
plt.plot(y_test,y_pred,'go',marker = 'o',markersize = 2);
plt.xlabel('Actual Value (y_test)');
plt.ylabel('Predicted Value (y_pred)');
plt.title('Multiple Linear Regression Model Performance: Los Angeles',fontdict={'fontsize':14});
We don't see an improvement in the Los Angeles model when training a MLR model to Los Angeles data only.
san_francisco_filtered_data = filtered_data_all_cities[filtered_data_all_cities['city_copy_San Francisco'] == 1]
# Create new correlations to price list
correlations_to_price = san_francisco_filtered_data.corr().price.abs().sort_values(ascending = False)
correlations_to_price[86:90]
# Create improved model by defining features and target
y = san_francisco_filtered_data[['price']]
list_of_all_factors = correlations_to_price.keys().tolist()[1:80] # Ignores "price" target variable
X = san_francisco_filtered_data[list_of_all_factors]
# Partition data into Training and Testing set.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=0) # uses int seed of 0
from sklearn.linear_model import LinearRegression
# Train model
regressor = LinearRegression()
regressor.fit(X_train, y_train);
# Predict from model
y_pred = regressor.predict(X_test)
from sklearn import metrics
# Evaluate the model performance
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
print('R-squared:',regressor.score(X,y))
plt.figure(figsize=[15,15])
plt.plot(y_test,y_pred,'go',marker = 'o',markersize = 2);
plt.xlabel('Actual Value (y_test)');
plt.ylabel('Predicted Value (y_pred)');
plt.title('Multiple Linear Regression Model Performance: Los Angeles',fontdict={'fontsize':14});
We don't see an improvement in the San Francisco model when training a MLR model to San Francisco data only.
Overall Summary:
In this project, I built a MLR model to predict AirBnB listing prices for listings in San Francisco, Los Angeles, and San Diego. I was able to iteratively build and improve several models that provided a decent R^2 value of approximately 0.4. When analyzing cities individually, I had more success with San Diego, achieving an R^2 value of approximately 0.6, but this could not be achieved with the other cities. Overall, while this model produces a reasonable estimate of price, particularly at the lower end of the price range, owners with homes above 1000 dollars would not be wise to use this model. Owners in the San Diego area would receive considerable benefit from applying this model to their pricing strategy, as the mean absolute error was just 60 dollars. Additional feature engineering as shown in this project, would continue to improve the model performance.
This project allowed me to utilize my instruction in Python as well as several of the major data science packages / libraries to conduct an analysis on a typical type of problem that a company like AirBnB routinely develops solutions for.
Limitations:
Estimating rental prices is a difficult challenge because the target variable is subject to hundreds of variables. A MLR model is appropriate for linear relationships between the target and input variables. However, it is likely that price does not follow linear relationships for many variables. This negatively impacts the ability of an MLR model to produce reliable predictions. I believe this may have happened in my model, as the error terms grew drastically at the higher end of the price range, with the linear regression model being unable to properly fit the more opulent rentals. In such cases, a different statistical approach may be more appropriate.
One major limitation in the dataset is that price is variable on most AirBnB listings. Owners generally adjust their prices based on day of the week and time of the year, and this data is not available from the current dataset. Removing this seasonality and variability on pricing would certainly allow for a more robust model, as this model currently assumes that pricing is uniform for a particular listing.
Perhaps the most prohibitive limitation in this dataset is the impact an owner can have on the price of a listing based on their actual revenue desires. Most owners will try to maximize revenue, but there are many factors that determine whether the full revenue potential of a listing is reached. For example, some owners perform their own cleaning and must factor in time when evaluating the price they will place on their listing. An experienced owner, as well, may have had more time to finely tune their pricing than an inexperienced one. Over time, such an owner would gravitate towards improved yield, which may or may not be associated with higher or lower price in the listing. It is very difficult to ascertain from the data whether these factors exist and to what degree they play a role in listing price.
Areas to Extend:
This model may improved by identifying stronger features. A future direction of this project would be to employ Principal Component Analysis to identify classifiers for segments of the listing data. For example, PCA would be well-suited to identify homes that are actually wedding venues or part of a specific region or area that is not captured by the current dataset as a city or neighborhood value. Implementation of these features, as shown in the project, would continue to improve the model performance.
Other area of improvement would include: